This is a take-home and an open book final. Helping or asking help from others is considered plagiarism.
1 Q1. (25 pts) Survival data analysis
Consider following survival times of 25 patients with no history of chronic diesease (chr = 0) and 25 patients with history of chronic disease (chr = 1).
Manually fill in the missing information in the following tables of ordered failure times for groups 1 (chr = 0) and 2 (chr = 1). Explain how survival probabilities (last column) are calculated.
Answer:
Group 1 (chr = 0):
time
n.risk
n.event
survival
1.8
25
1
0.96
2.2
24
1
0.92
2.5
23
1
0.88
2.6
22
1
0.84
3.0
21
1
0.80
3.5
20
1
0.76
3.8
19
1
0.72
5.3
18
1
0.68
5.4
17
1
0.64
5.7
16
1
0.60
6.6
15
1
0.56
8.2
14
1
0.52
8.7
13
1
0.48
9.2
12
2
0.40
9.8
10
1
0.36
10.0
9
1
0.32
10.2
8
1
0.28
10.7
7
1
0.24
11.0
6
1
0.20
11.1
5
1
0.16
11.7
4
4
0.00
Group 2 (chr = 1):
time
n.risk
n.event
survival
1.4
25
1
0.96
1.6
24
1
0.92
1.8
23
1
0.88
2.4
22
1
0.84
2.8
21
1
0.80
2.9
20
1
0.76
3.1
19
1
0.72
3.5
18
1
0.68
3.6
17
1
0.64
3.9
16
1
0.60
4.1
15
1
0.56
4.2
14
1
0.52
4.7
13
1
0.48
4.9
12
1
0.44
5.2
11
1
0.40
5.8
10
1
0.36
5.9
9
1
0.32
6.5
8
1
0.28
7.8
7
1
0.24
8.3
6
1
0.20
8.4
5
1
0.16
8.8
4
1
0.12
9.1
3
1
0.08
9.9
2
1
0.04
11.4
1
1
0.00
The empirical survivor function, an estimate of the probability of survival beyond time \(y\), is \[
\widehat{S}(y) = \widehat{\mathbb{P}}(Y > y).
\] Arrange the even times \(y_{(1)} \le y_{(2)} \le \cdots \le y_{(k)}\). Kaplan-Meier formula or product limit formula\[\begin{eqnarray*}
\widehat{S}(y_{k}) &=& \widehat{\mathbb{P}}(Y > y_{(k)}) \\
&=& \widehat{\mathbb{P}}(Y > y_{(k-1)}) \times \widehat{\mathbb{P}}(Y > y_{(k)} \mid Y > y_{(k-1)}) \\
&=& \widehat{S}(y_{k-1}) \times \widehat{\mathbb{P}}(Y > y_{(k)} \mid Y > y_{(k-1)}) \\
&=& \cdots \\
&=& \prod_{j=1}^k \widehat{\mathbb{P}}(Y > y_{(j)} \mid Y > y_{(j-1)}) \\
&=& \prod_{j=1}^k \left( \frac{n_j - d_j}{n_j} \right),
\end{eqnarray*}\] where \[\begin{eqnarray*}
n_j &=& \text{number of subjects who are alive just before time $y_{(j)}$}, \\
d_j &=& \text{number of deaths that occur at time $y_{(j)}$}.
\end{eqnarray*}\] Here we do not have cencored data so the survival probabilities in each row are calculated as \(\frac{n.risk - n.event}{total\text{ }number}\). For example, the survival probability at time 3.5 for group 1 is calculated as \(\frac{20 - 1}{25} = 0.76\).
Use R to display the Kaplan-Meier survival curves for groups 1 (chr = 0) and 2 (chr = 1).
Write down the log-likelihood of the parametric exponential (proportional hazard) model for survival times. Explain why this model can be fit as a generalized linear model with offset.
Answer: - A datum of a survival data set is \((y_j, \delta_j, \mathbf{x}_j)\) where
- \(y_j\) is the survival time,
- \(\delta_j\) is the censoring indicator with \(\delta_j = 1\) if the survival time is uncensored and \(\delta_j=0\) if it is censored, and
- \(\mathbf{x}_j\) is the predictor vector.
Let \(y_1, \ldots, y_r\) be the uncensored observations and \(y_{r+1}, \ldots, y_n\) the censored ones. The likelihood is \[
L = \prod_{j=1}^r f(y_j) \times \prod_{j=r+1}^n S(y_j) = \prod_{j=1}^n f(y_j)^{\delta_j} S(y_j)^{1 - \delta_j}
\] so the log-likelihood is \[
\ell = \sum_{i=1}^n [\delta_j \log f(y_j) + (1 - \delta_j) \log S(y_j)] = \sum_{j=1}^n [\delta_j \log h(y_j) + \log S(y_j)]
\] , where \(h(y_j) = f(y_j) / S(y_j)\) is the hazard function and \(S(y_j) = 1-F(y_j)\) is the survival function.
The log-likelihood for the exponential model is \[
\ell(\boldsymbol{\theta}) = \sum_j [\delta_j \log h(y_j) + \log S(y_j)] = \sum_j (\delta_j \log \theta_j - \theta_j y_j),
\] since the survival function is \(S(y_j) = e^{-\theta_j y_j}\) and the hazard function is \(h(y_j) = \theta_j\) for the exponential model. For poisson regression, the log-likelihood is \[
\ell(\boldsymbol{\beta}) = \sum_i y_i \log \mu_i - \mu_i - \log y_i!
\] If treating \(\delta_j\) as Poisson\((\mu_j)\) where \(\mu_j = \theta_j y_j\) then the log-likelihood for the poisson model is \[
\ell(\boldsymbol{\theta}) = \sum_j y_j \log (\theta_j y_j) - \theta_j y_j - \log y_j!
\] The log-likehood for the exponential model is \[
\ell(\boldsymbol{\theta}) = \sum_j (\delta_j \log (\theta_j y_j) - \theta_j y_j- \delta_j \log y_j)
\] Only the survival times \(y_j\) are included in the model as an offset term \(\log y_j\). And \(\log \mu_j = \log (\theta_j y_j) = \mathbf{x}_j^T \boldsymbol{\beta} + \log(y_j)\). So the exponential model can be fit as a generalized linear model with offset.
Fit the exponential (proportional hazard) model on the chr data using R. Interpret the coefficients.
Answer:
emod <-survreg(Surv(time, cen) ~ chr, dist ="exponential", data = chr_data)summary(emod)
Call:
survreg(formula = Surv(time, cen) ~ chr, data = chr_data, dist = "exponential")
Value Std. Error z p
(Intercept) 2.014 0.200 10.07 <2e-16
chr -0.350 0.283 -1.24 0.22
Scale fixed at 1
Exponential distribution
Loglik(model)= -141.9 Loglik(intercept only)= -142.7
Chisq= 1.52 on 1 degrees of freedom, p= 0.22
Number of Newton-Raphson Iterations: 4
n= 50
Coefficient for intercept = -0.350: When patients do not have history of chronic disease, the hazard rate is \(e^{-0.350} = 0.705\).
Coefficient for chr = 2.014: When patients have history of chronic disease, the hazard rate is \(e^{2.014} = 7.49\) times higher than those who do not have history of chronic disease.
Comment on the limitation of exponential model compared to other more flexible models such as Weibull.
Answer: The exponential model assumes that the hazard rate is constant over time. This may not be realistic in many cases. The Weibull model is more flexible than the exponential model because it allows the hazard rate to change over time.
2 Q2 (25 pts). Longitudinal data analysis
Onychomycosis, popularly known as toenail fungus, is a fairly common condition that not only can disfigure and sometimes destroy the nail but that also can lead to social and self-image issues for sufferers. Tight-fitting shoes or hosiery, the sharing of common facilities such as showers and locker rooms, and toenail polish are all thought to be implicated in the development of onychomycosis. This question relates to data from a study conducted by researchers that recruited sufferers of a particular type of onychomycosis, dermatophyte onychomycosis. The study conducted by the researchers was focused on comparison of two oral medications, terbinafine (given as 250 mg/day, denoted as treatment 1 below) and itraconazole (given as 200 mg/day, denoted as treatment 2 below).
The trial was conducted as follows. 200 sufferers of advanced toenail dermatophyte onychomycosis in the big toe were recruited, and each saw a physician, who removed the afflicted nail. Each subject was then randomly assigned to treatment with either terbinafine (treatment 1) or itraconazole (treatment 2). Immediately prior to beginning treatment, the length of the unafflicted part of the toenail (which was hence not removed) was recorded (in millimeters). Then at 1 month, 2 months, 3 months, 6 months, and 12 months, each subject returned, and the length of the unafflicted part of the nail was measured again. A longer unafflicted nail length is a better outcome. Also recorded on each subject was gender and an indicator of the frequency with which the subject visited a gym or health club (and hence might use shared locker rooms and/or showers).
The data are available in the file toenail.txt from here. The data are presented in the form of one data record per observation; the columns of the data set are as follows:
Subject id
Health club frequency indicator (= 0 if once a week or less, = 1 if more than once a week)
Gender indicator (= 0 if female, = 1 if male)
Month
Unafflicted nail length (the response, mm)
Treatment indicator (= 1 if terbinafine, = 2 if itraconazole)
The researchers had several questions, which they stated to you as follows:
Use the linear mixed effect model (LMM) to answer: Is there a difference in the pattern of change of lengths of the unafflicted part of the nail between subjects receiving terbinafine and itraconazole over a 12 month period? Does one treatment show results more quickly?
Plot the change of lengths of the unafflicted part of the nail over time and separated by treatment groups. Comment on overall patterns over time.
Based on the pattern observed, pick appropriate time trend in the LMM and provide an algebraic definition for your chosen LMM, e.g., is the linear trend model adequate? or quadratic trend is needed? or any other pattern is more approriate? justify your answer.
Model the covariance: fit both random intercept and random slope model and determine which one fits the data better.
library(ggplot2)ggplot(toenail, aes(x = month, y = length, group = id, color = treatment)) +geom_line() +facet_wrap(~treatment) +labs(x ="Month", y ="Length (mm)")
It seems that the length of the unafflicted part of the nail increases over time for both treatments. The increase seems to be faster for itraconazole than terbinafine.
From the plots above, it seems that the linear trend model is adequate. The length of the unafflicted part of the nail seems to increase linearly over time for both treatments. For the analysis part, we only include treatment and month and control for health and gender. We can also use the Kenward-Roger adjusted F-test to test the fixed effects to see if the linear trend model is adequate.
The p-value is 0.3992, so the quadratic trend model is not significantly better than the linear trend model. So the linear trend model is adequate. Next we test the interaction between treatment and month.
The p-value is 0.002617 so the interaction between treatment and month is significant. We then consider the linear mixed model with random intercepts and random slopes: \[
\text{length}_{ij} = \mu + \beta_1 \text{month}_{i} + \beta_2 \text{treatment}_{j} + \beta_3 \text{month}_{i} \times \text{treatment}_{j} + b_{0j} + b_{1j} \text{month}_{i} + \epsilon_{ij},
\] where \(i\) indexes month and \(j\) indexes individual. The random intercepts and slopes are iid \[
\begin{pmatrix}
b_{0,j} \\
b_{1,j}
\end{pmatrix} \sim N(\mathbf{0}, \boldsymbol{\Sigma}).
\] and the noise terms \(\epsilon_{ij}\) are iid \(N(0, \sigma^2)\).
sig01 is the variance of the random intercepts, sig02 is the covariance between the random intercepts and slopes, and sig03 is the variance of the random slopes. Both random intercepts and slopes are significant because their intervals do not include 0. The itraconazole group is not significantly different from the terbinafine group over a 12 month period because the confidence interval includes 0.
Use the linear mixed effect model (LMM) to answer: Is there an association between the pattern of change of nail lengths and gender and/or health club frequency in subjects taking terbinafine? This might indicate that this drug brings about relief more swiftly in some kinds of subject versus others.
Provide graphs to show patterns the change of nail lengths and gender and/or health club frequency in subjects taking terbinafine.
Based on the pattern observed from question 1, pick appropriate time trend in the LMM and provide an algebraic definition for your chosen LMM, e.g., is the linear trend model adequate? or quadratic trend is needed? or any other pattern is more approriate? justify your answer.
Model the covariance: fit both random intercept and random slope model and determine which one fits the data better.
Answer:
association between length and gender
toenail |>filter(treatment =="Terbinafine") |>ggplot(aes(x = month, y = length, group = id, color = gender)) +geom_line() +facet_wrap(~gender) +labs(x ="Month", y ="Length (mm)")
It seems that the length of the unafflicted part of the nail increases over time for both gender groups when treatment is terbinafine. The length of the unafflicted part of the nail for female seems to be a little higher than the male.
From the plots above, it seems that the linear trend model is adequate. The length of the unafflicted part of the nail seems to increase linearly over time for both genders. For the analysis part, we only include gender and month and control for health and treatment. We can also use the Kenward-Roger adjusted F-test to test the fixed effects to see if the linear trend model is adequate.
The p-value is 0.5956, so the quadratic trend model is not significantly better than the linear trend model. So the linear trend model is adequate. So the association between the pattern of change of nail lengths and gender is linear. Next we test the interaction between treatment and month.
The p-value is 0.3564 so the interaction between month and gender is not significant. We do not need to include interaction term in the model. We then consider the linear mixed model with random intercepts and random slopes: \[
\text{length}_{ij} = \mu + \beta_1 \text{month}_{i} + \beta_2 \text{gender}_{j} + b_{0j} + b_{1j} \text{month}_{i} + \epsilon_{ij},
\] where \(i\) indexes month and \(j\) indexes individual. The random intercepts and slopes are iid \[
\begin{pmatrix}
b_{0,j} \\
b_{1,j}
\end{pmatrix} \sim N(\mathbf{0}, \boldsymbol{\Sigma}).
\] and the noise terms \(\epsilon_{ij}\) are iid \(N(0, \sigma^2)\).
sig01 is the variance of the random intercepts, sig02 is the covariance between the random intercepts and slopes, and sig03 is the variance of the random slopes. Both random intercepts and slopes are significant because their intervals do not include 0. The male group is not significantly different from the female group because the confidence interval includes 0.
association between length and health club frequency
toenail |>filter(treatment =="Terbinafine") |>ggplot(aes(x = month, y = length, group = id, color = health)) +geom_line() +facet_wrap(~health) +labs(x ="Month", y ="Length (mm)")
It seems that the length of the unafflicted part of the nail increases over time for both gender groups when treatment is terbinafine. The length of the unafflicted part of the nail for low health club frequency group seems to be higher than the high health club frequency group.
From the plots above, it seems that the linear trend model is adequate. The length of the unafflicted part of the nail seems to increase linearly over time for both health frequencies. For the analysis part, we only include health and month and control for gender and treatment. We can also use the Kenward-Roger adjusted F-test to test the fixed effects to see if the linear trend model is adequate.
mmod <-lmer(length ~ month + health + (month | id), data =filter(toenail, treatment =="Terbinafine"), REML =TRUE)mmodr <-lmer(length ~poly(month, 2) + health + (month | id), data =filter(toenail, treatment =="Terbinafine"), REML =TRUE)KRmodcomp(mmod, mmodr)
large : length ~ poly(month, 2) + health + (month | id)
small : length ~ month + health + (month | id)
stat ndf ddf F.scaling p.value
Ftest 0.2822 1.0000 399.0000 1 0.5956
The p-value is 0.5956, so the quadratic trend model is not significantly better than the linear trend model. So the linear trend model is adequate. So the association between the pattern of change of nail lengths and health club frequency is linear. Next we test the interaction between treatment and month.
mmod <-lmer(length ~ month + health + (month | id), data =filter(toenail, treatment =="Terbinafine"), REML =TRUE)mmodr <-lmer(length ~ month * health + (month | id), data =filter(toenail, treatment =="Terbinafine"), REML =TRUE)KRmodcomp(mmod, mmodr)
large : length ~ month * health + (month | id)
small : length ~ month + health + (month | id)
stat ndf ddf F.scaling p.value
Ftest 0.0066 1.0000 98.0000 1 0.9356
The p-value is 0.9356 so the interaction between month and gender is not significant. We do not need to include interaction term in the model. We then consider the linear mixed model with random intercepts and random slopes: \[
\text{length}_{ij} = \mu + \beta_1 \text{month}_{i} + \beta_2 \text{health}_{j} + b_{0j} + b_{1j} \text{month}_{i} + \epsilon_{ij},
\] where \(i\) indexes month and \(j\) indexes individual. The random intercepts and slopes are iid \[
\begin{pmatrix}
b_{0,j} \\
b_{1,j}
\end{pmatrix} \sim N(\mathbf{0}, \boldsymbol{\Sigma}).
\] and the noise terms \(\epsilon_{ij}\) are iid \(N(0, \sigma^2)\).
library(lme4)mmod <-lmer(length ~ month + health + (month | id), data =filter(toenail, treatment =="Terbinafine"))summary(mmod)
Linear mixed model fit by REML ['lmerMod']
Formula: length ~ month + health + (month | id)
Data: filter(toenail, treatment == "Terbinafine")
REML criterion at convergence: 2110.2
Scaled residuals:
Min 1Q Median 3Q Max
-2.69831 -0.51211 0.03935 0.49881 2.83201
Random effects:
Groups Name Variance Std.Dev. Corr
id (Intercept) 1.89097 1.3751
month 0.04666 0.2160 0.17
Residual 0.93674 0.9679
Number of obs: 600, groups: id, 100
Fixed effects:
Estimate Std. Error t value
(Intercept) 6.14331 0.25810 23.802
month 0.27149 0.02371 11.450
healthHigh -1.19692 0.31526 -3.797
Correlation of Fixed Effects:
(Intr) month
month 0.020
healthHigh -0.818 0.000
sig01 is the variance of the random intercepts, sig02 is the covariance between the random intercepts and slopes, and sig03 is the variance of the random slopes. Both random intercepts and slopes are significant because their intervals do not include 0. The high frequency group is significantly different from the low frequency group because the confidence interval includes 0.
In answering these scientific questions of interest, clearly write out the analytic models you consider for answering these questions (as detailed in the sub-questions). Clearly outline your decision making process for how you selected your final models. Fit your chosen final models and report to the project investigators on the stated scientific questions of interest.
Answer: From the Q2.1, we see that the linear trend model is adequate and we need to include the interaction term between treatment and month. From the t-value of the linear mixed model, we notice month and the interaction term is significant but the main effect of treatment is not significant. However, in order to maintain the hierarchically well-formulated principle, we include the treatment in the model. Also, by checking the confidence interval of random intercepts and slopes, we see that both are significant.
From the Q2.2, we see that both month-gender and month-health models are linear but two interactions are not significant. Also, from the t-value of linear mixed models, in month-gender model, gender is not significant, while in month-health model, month and health` are significant. By checking the confidence interval of random intercepts and slopes in two models, we see that both are significant.
So we consider the linear mixed model with random intercepts and random slopes. The final model is \[
\text{length}_{ij} = \mu + \beta_1 \text{month}_{i} + \beta_2 \text{treatment}_{j} + \beta_3 \text{month}_{i} \times \text{treatment}_{j}+\beta_4 \text{health}_{j} + b_{0j} + b_{1j} \text{month}_{i} + \epsilon_{ij},
\] where \(i\) indexes month and \(j\) indexes individual. The random intercepts and slopes are iid \[
\begin{pmatrix}
b_{0,j} \\
b_{1,j}
\end{pmatrix} \sim N(\mathbf{0}, \boldsymbol{\Sigma}).
\] and the noise terms \(\epsilon_{ij}\) are iid \(N(0, \sigma^2)\).
mmod <-lmer(length ~ month * treatment + health + (month | id), data = toenail)summary(mmod)
Linear mixed model fit by REML ['lmerMod']
Formula: length ~ month * treatment + health + (month | id)
Data: toenail
REML criterion at convergence: 4340.8
Scaled residuals:
Min 1Q Median 3Q Max
-2.58679 -0.51413 0.02595 0.47767 2.92726
Random effects:
Groups Name Variance Std.Dev. Corr
id (Intercept) 2.0600 1.4353
month 0.1104 0.3322 -0.45
Residual 0.9412 0.9701
Number of obs: 1200, groups: id, 200
Fixed effects:
Estimate Std. Error t value
(Intercept) 6.06663 0.20557 29.511
month 0.27149 0.03464 7.838
treatmentItraconazole 0.08987 0.21775 0.413
healthHigh -1.08247 0.20331 -5.324
month:treatmentItraconazole 0.14933 0.04899 3.048
Correlation of Fixed Effects:
(Intr) month trtmnI hlthHg
month -0.354
trtmntItrcn -0.536 0.334
healthHigh -0.663 0.000 0.009
mnth:trtmnI 0.250 -0.707 -0.472 0.000
We can report that the length of the unafflicted part of the nail increases linearly over time for both health frequencies and for both treatments. The high frequency group is significantly different from the low frequency group. It seems if we want to increase the length of the unafflicted part of the nail, we should reduce the health club frequency and use the itraconazole treatment.
3 Q3 (25 pts). GEE and GLMM
The Skin Cancer Prevention Study, a randomized, double-blind, placebo-controlled clinical trial, was designed to test the effectiveness of beta-carotene in the prevention of non-melanoma skin cancer in high-risk subjects. A total of 1,683 subjects were randomized to either placebo or 50mg of beta-carotene per day and were followed for up to 5 years. Subjects were examined once per year and biopsied if a cancer was suspected to determine the number of new cancers per year. The outcome variable, \(Y\), is a count of the number of new skin cancers per year. You may assume that the counts of new skin cancers, \(Y\), are from exact one-year periods (so that no offset term is needed).
Selected data from the study are in the dataset called skin.txt and is available here. Each row of the dataset contains the following 9 variables: ID, Center, Age, Skin, Gender, Exposure, \(Y\), Treatment, Year. These variables take values as follows:
Variable
ID:
Subject identifier number
Center:
Identifier number for center of enrollment
Age:
Subject’s age in years at randomization
Skin:
Skin type (1=burns; 0 otherwise) [evaluated at randomization and doesn’t change with time]
Gender:
1=male; 0=female
Exposure:
Count of number of previous skin cancers [prior to randomization]
\(Y\):
Count of number of new skin cancers in the Year of follow-up
Treatment:
1=beta-carotene; 0=placebo
Year:
Year of follow-up after starting randomized treatment
Your collaborator is interested in assessing the effect of treatment on the incidence of new skin cancers over time. As the statistician on the project, provide an analysis of the data that addresses this question. Specifically, the investigator at Center=1 is interested in characterizing the distribution of risk among subjects at her center. In the following, only include the subset of subjects with Center=1 in the analysis.
ID Center Age Skin Gender Exposure Y Treatment Year
1 100034 1 51 burns male 4 0 placebo 1
2 100034 1 51 burns male 4 1 placebo 2
3 100034 1 51 burns male 4 1 placebo 3
4 100034 1 51 burns male 4 1 placebo 4
5 100034 1 51 burns male 4 0 placebo 5
6 100045 1 68 burns female 2 0 placebo 1
Provide an algebraic definition for a generalized linear marginal model (GEE) in which the only effects are for the intercept and Year (as a continuous variable). Fit this model and provide a table which includes the estimates of the parameters in your model.
Answer: The GEE is defined as follows: Let \(\mathbf{Y}_i\) be the response vector of \(i\)-th individual. \[
\begin{eqnarray*}
\mathbb{E} \mathbf{Y}_i &=& \boldsymbol{\mu}_i \\
g(\boldsymbol{\mu}_i) &=& \mathbf{X}_i \boldsymbol{\beta} = \beta_0 + \beta_1 \text{Year}_i \\
\mathbf{V}_i &=& \text{Var}(\mathbf{Y}_i) = \phi \mathbf{A}_i^{1/2} \mathbf{R}_i(\boldsymbol{\alpha}) \mathbf{A}_i^{1/2},
\end{eqnarray*}
\] where \(\mathbf{A}_i = \text{diag}(a(\boldsymbol{\mu}))\) captures the individual variances and \(\mathbf{R}_i(\boldsymbol{\alpha})\) is a working correlation matrix. Given estimates of \(\phi\) and \(\boldsymbol{\alpha}\), we solve the estimation equation\[
\sum_i [D_{\boldsymbol{\beta}} \boldsymbol{\mu}_i(\boldsymbol{\beta})]^T \mathbf{V}_i^{-1}(\mathbf{Y}_i - \boldsymbol{\mu}_i) = \mathbf{0}.
\]
library(geepack)library(gtsummary)library(kableExtra)mmod <-geeglm(Y ~ Year, id = ID, data = skin, family = poisson, scale.fix =TRUE, corstr ="exchangeable")summary(mmod)
Call:
geeglm(formula = Y ~ Year, family = poisson, data = skin, id = ID,
corstr = "exchangeable", scale.fix = TRUE)
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) -1.5411 0.1674 84.739 <2e-16 ***
Year -0.1066 0.0486 4.807 0.0283 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation structure = exchangeable
Scale is fixed.
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0.2306 0.08122
Number of clusters: 422 Maximum cluster size: 5
Provide an algebraic definition for a generalized linear mixed model (GLMM) in which the only fixed effects are for the intercept and Year (as a continuous variable), and the only random effect is the intercept. What is being assumed about how the distribution of risk among subjects changes with time?
Answer: The GLMM is defined as follows: In GLM, the distribution of \(Y_i\) is from the exponential family of distributions of form \[
f(y_i \mid \theta_i, \phi) = \exp \left[ \frac{y \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right].
\] If we use the canonical link, then \[
\theta_i = g(\mu_i) = \eta_i,
\] where \(\mu_i = \mathbb{E} Y_i\). Now let \[
\eta_i = \mathbf{x}_i^T \boldsymbol{\beta} + \mathbf{z}_i^T \boldsymbol{\gamma}_i,
\] where \(\boldsymbol{\beta}\) is the fixed effects and \(\boldsymbol{\gamma}\) is the random effects with density \(h(\gamma \mid \boldsymbol{\Sigma})\). (Typically we assume multivariate normal with mean 0 and covariance \(\boldsymbol{\Sigma}\).) Then the likelihood is \[
L(\boldsymbol{\beta}, \phi, \boldsymbol{\Sigma}) = \prod_{i=1}^n \int f(y_i \mid \boldsymbol{\beta}, \phi, \boldsymbol{\gamma}) h(\boldsymbol{\gamma} \mid \boldsymbol{\Sigma}) \, d \boldsymbol{\gamma}
\] and the log-likelihood is \[
\ell(\boldsymbol{\beta}, \phi, \boldsymbol{\Sigma}) = \sum_{i=1}^n \log \int f(y_i \mid \boldsymbol{\beta}, \phi, \boldsymbol{\gamma}) h(\boldsymbol{\gamma} \mid \boldsymbol{\Sigma}) \, d \boldsymbol{\gamma}.
\] This model, combining GLM and mixed effects model, is called the generalized linear mixed effects model (GLMM). In this question, \[
Y_{ij} \sim \text{Poisson}(\mu_{ij})\\
\quad \log(\mu_{ij}) = \beta_0 + \beta_1 \text{Year}_i + \gamma_{0j}\\
\gamma_{0j} \sim \text{Normal}(0, \sigma^2)
\] It is assumed that number of new skin cancers in the Year of follow-up follows a Poisson distribution with mean \(\mu_{ij}\), where \(\mu_{ij}\) is a function of the fixed effects \(\beta_0\) and \(\beta_1\) and the random effect \(\gamma_{0j}\). \(\mu_{ij}\) is linear in Year and the random effect \(\gamma_{0j}\) is assumed to be normally distributed with mean 0 and variance \(\sigma^2\).
Fit your chosen GLMM and provide a table from your output which includes the estimates for the parameters in your GLMM, and provide careful interpretation of the Year term.
Answer:
mmod1 <-glmer(Y ~ Year + (1| ID), data = skin, family = poisson)summary(mmod1)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: poisson ( log )
Formula: Y ~ Year + (1 | ID)
Data: skin
AIC BIC logLik deviance df.resid
1641.7 1658.3 -817.8 1635.7 1880
Scaled residuals:
Min 1Q Median 3Q Max
-1.477 -0.260 -0.234 -0.210 3.895
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 1.57 1.25
Number of obs: 1883, groups: ID, 422
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2988 0.1756 -13.09 <2e-16 ***
Year -0.1083 0.0434 -2.49 0.013 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
Year -0.647
tibble(Intercept =fixef(mmod1)[1], Year =fixef(mmod1)[2]) |>kable("html") |>kable_styling("striped", full_width = F)
Intercept
Year
-2.299
-0.1083
Coefficient of Year = -0.108, which means for every additional year of follow-up after the beginning of the experiment at center 1, the estimated average number of new skin cancers decreases by $ 10.3% = 1-e^{-0.1083}$.
Are the estimates for the fixed intercept terms the same or different in the GLMM compared with the marginal model fitted in question (1)? Why are they the same or different?
Answer: The estimates for the fixed intercept terms are different in the GLMM compared with the marginal model fitted in question (1). The fixed intercept term in the GLMM is -2.2988, while the fixed intercept term in the marginal model is -1.541.
The fixed intercept term in GEE represents the average response across the population, while in GLMM, it represents the response within a typical cluster after adjusting for the random effects. To be specific, the fixed intercept term in the GLMM is the log of the expected number of new skin cancers at center 1 during the first year of follow-up, whereas the fixed intercept term in the marginal model is the log of the expected number of new skin cancers at center 1 during the first year of follow-up averaged over all centers.
Use the parameter estimates from your GLMM and your model definition to characterize the distribution of expected counts of new skin cancers among subjects at center 1 during their first year of follow-up.
This question is adapted from Exercise 11.2 of ELMR (p251). Read the documentation of the dataset hprice in Faraway package before working on this problem.
Make a plot of the data on a single panel to show how housing prices increase by year. Describe what can be seen in the plot.
Answer:
data(hprice)ggplot(hprice, aes(x = time, y = narsp, group = msa)) +geom_line() +labs(x ="Year", y ="Natural Log of House Price")
From the plot, we can see that most of MSA’s house prices increase over time. Most of MSA’s natural log average house prices are between 4 to 4.5 thousand dollars and these MSA has almost the same slope. Several MSA’s natural log average house prices are around 5 thousand dollars and they have almost the same slope.
Fit a linear model with the (log) house price as the response and all other variables (except msa) as fixed effect predictors. Which terms are statistically significant? Discuss the coefficient for time.
Answer:
mod <-lm(narsp ~ . - msa, data = hprice)summary(mod)
Call:
lm(formula = narsp ~ . - msa, data = hprice)
Residuals:
Min 1Q Median 3Q Max
-0.3139 -0.1081 -0.0152 0.0855 0.5559
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.67e+00 8.46e-02 31.61 < 2e-16 ***
ypc 7.03e-05 4.36e-06 16.13 < 2e-16 ***
perypc -1.37e-02 5.07e-03 -2.70 0.00722 **
regtest 2.95e-02 3.10e-03 9.52 < 2e-16 ***
rcdum1 1.49e-01 3.24e-02 4.60 6.1e-06 ***
ajwtr1 3.59e-02 2.00e-02 1.80 0.07348 .
time -1.77e-02 5.13e-03 -3.45 0.00065 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.165 on 317 degrees of freedom
Multiple R-squared: 0.757, Adjusted R-squared: 0.753
F-statistic: 165 on 6 and 317 DF, p-value: <2e-16
Intercept, ypc, perypc, regtest, rcdum, and time are statistically significant. Coefficient for time = -1.77e-02. This means that for every additional year, the average house price in thousands of dollars is estimated to decrease by \(1-e^{-0.0177} = 1.75\%\), controlling for all other variables in the model.
Make a plot that shows how per-capita income changes over time. What is the nature of the increase? Make a similar plot to show how income growth changes over time. Comment on the plot.
Answer:
ggplot(hprice, aes(x = time, y = ypc, group = msa)) +geom_line() +labs(x ="Year", y ="Per Capita Income")
From the plot, all the MSA’s per capita income increases over time and it seems the increase is linear.
ggplot(hprice, aes(x = time, y = perypc, group = msa)) +geom_line() +labs(x ="Year", y ="Per Capita Income")
From the plot, the percentage growth in per capita income is not linear but they have a similar pattern. We can see all the percentage growth in per capita income decreases in 1987 and then increases to a peak in 1989. After 1989, most of the percentage growth per capita income decreases, with some decreasing significantly and some slightly. Some percentage growth per capita income increases slightly after 1989. Finally, all percentage growth per capita income converge to around 4 percentage growth.
Create a new variable that is the per-capita income for the first time period for each MSA. Refit the same linear model but now using the initial income and not the income as it changes over time. Compare the two models.
Call:
lm(formula = narsp ~ . - msa - ypc + ypc_initial, data = hprice)
Residuals:
Min 1Q Median 3Q Max
-0.3728 -0.1055 -0.0154 0.0790 0.5391
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.30e+00 9.67e-02 23.80 < 2e-16 ***
perypc -8.13e-03 4.97e-03 -1.64 0.10
regtest 3.01e-02 3.04e-03 9.93 < 2e-16 ***
rcdum1 1.51e-01 3.16e-02 4.78 2.7e-06 ***
ajwtr1 3.85e-02 1.96e-02 1.97 0.05 .
time 3.71e-02 3.79e-03 9.80 < 2e-16 ***
ypc_initial 8.87e-05 5.28e-06 16.81 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.162 on 317 degrees of freedom
Multiple R-squared: 0.766, Adjusted R-squared: 0.762
F-statistic: 173 on 6 and 317 DF, p-value: <2e-16
summary(mod)
Call:
lm(formula = narsp ~ . - msa, data = hprice)
Residuals:
Min 1Q Median 3Q Max
-0.3139 -0.1081 -0.0152 0.0855 0.5559
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.67e+00 8.46e-02 31.61 < 2e-16 ***
ypc 7.03e-05 4.36e-06 16.13 < 2e-16 ***
perypc -1.37e-02 5.07e-03 -2.70 0.00722 **
regtest 2.95e-02 3.10e-03 9.52 < 2e-16 ***
rcdum1 1.49e-01 3.24e-02 4.60 6.1e-06 ***
ajwtr1 3.59e-02 2.00e-02 1.80 0.07348 .
time -1.77e-02 5.13e-03 -3.45 0.00065 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.165 on 317 degrees of freedom
Multiple R-squared: 0.757, Adjusted R-squared: 0.753
F-statistic: 165 on 6 and 317 DF, p-value: <2e-16
In this new model, the first time per-capita income ypc_initial is also statistically significant. The coefficient of time becomes positive while in the previous model it was negative. Now the positive coefficient is consistent with the plot in Q3. The adjusted R-squared value of the new model is 0.762, which is slightly higher than the previous model’s adjusted R-squared value of 0.753.
Fit a mixed effects model that has a random intercept for each MSA. Why might this be reasonable? The rest of the model should have the same structure as in the previous question. Make a numerical interpretation of the coefficient of time in your model. Explain the difference between REML and MLE methods.
Warning: Some predictor variables are on very different scales: consider
rescaling
summary(mmod)
Linear mixed model fit by REML ['lmerMod']
Formula: narsp ~ . - msa - ypc + ypc_initial + (1 | msa)
Data: hprice
REML criterion at convergence: -581.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.385 -0.594 -0.025 0.574 2.858
Random effects:
Groups Name Variance Std.Dev.
msa (Intercept) 0.02361 0.1537
Residual 0.00545 0.0738
Number of obs: 324, groups: msa, 36
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.31e+00 2.63e-01 8.78
perypc -9.15e-03 2.30e-03 -3.98
regtest 3.02e-02 8.75e-03 3.45
rcdum1 1.51e-01 9.11e-02 1.66
ajwtr1 3.85e-02 5.65e-02 0.68
time 3.68e-02 1.73e-03 21.28
ypc_initial 8.86e-05 1.52e-05 5.83
Correlation of Fixed Effects:
(Intr) perypc regtst rcdum1 ajwtr1 time
perypc -0.049
regtest -0.504 -0.005
rcdum1 0.478 -0.004 -0.310
ajwtr1 0.110 0.001 -0.040 -0.178
time -0.047 0.395 -0.002 -0.002 0.000
ypc_initial -0.751 0.002 -0.174 -0.332 -0.182 0.001
fit warnings:
Some predictor variables are on very different scales: consider rescaling
Coefficient of time = 3.68e-02. This means that for every additional year, the average house price in thousands of dollars is estimated to increase by \(e^{0.0368}-1 = 3.75\%\), controlling for all other variables in the model.
The maximum likelihood estimation method (MLE) is biased for the estimation of variance components. The restricted maximum likelihood estimation method (REML) tries to reduce bias in the variance components. The variance component parameters are estimated by MLE using transformed data and then fixed effects are estimated using general least squares. The REML estimate does not depend on the choice of data transformation in the model.
Fit a model that omits the adjacent to water and rent control predictors. Test whether this reduction in the model can be supported.
Warning: Some predictor variables are on very different scales: consider
rescaling
summary(mmod2)
Linear mixed model fit by REML ['lmerMod']
Formula: narsp ~ . - msa - ypc + ypc_initial + (1 | msa) - ajwtr - rcdum
Data: hprice
REML criterion at convergence: -584.5
Scaled residuals:
Min 1Q Median 3Q Max
-2.3793 -0.5958 -0.0131 0.5801 2.8500
Random effects:
Groups Name Variance Std.Dev.
msa (Intercept) 0.02489 0.1578
Residual 0.00545 0.0738
Number of obs: 324, groups: msa, 36
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.05e+00 2.31e-01 8.87
perypc -9.13e-03 2.30e-03 -3.97
regtest 3.55e-02 8.49e-03 4.18
time 3.68e-02 1.73e-03 21.28
ypc_initial 1.01e-04 1.42e-05 7.08
Correlation of Fixed Effects:
(Intr) perypc regtst time
perypc -0.053
regtest -0.416 -0.006
time -0.053 0.395 -0.002
ypc_initial -0.698 0.000 -0.349 0.000
fit warnings:
Some predictor variables are on very different scales: consider rescaling
The p-value is 0.17, which is greater than 0.05. Therefore, we fail to reject the null hypothesis that the reduced model is significantly different from the full model. So the reduction in the model can be supported.
It is possible that the increase in prices may not be linear in year. Fit an additive mixed model where smooth is added to year. Make a plot to show how prices have increased over time.
Answer:
library(mgcv)mmod3 <-gamm(narsp ~ perypc + regtest + ypc_initial +s(time, k =9), random =list(msa =~1), data = hprice) summary(mmod3$gam)
Coefficient of ypc_initial = 1.01e-04. This means that for every 1,000 dollars increase in the initial per capita income, the average house price in thousands of dollars is estimated to increase by \(e^{0.101}-1 = 10.6\%\), controlling for all other variables in the model.
Coefficient of perypc = -1.12e-02. This means that for every \(1\%\) increase in percentage growth in per capita income, the average house price in thousands of dollars is estimated to decrease by \(1-e^{-0.0112} = 1.11\%\), controlling for all other variables in the model.
Coefficient of regtest = 3.56e-02. This means that for every one index increase in the regulation test, the average house price in thousands of dollars is estimated to increase by \(e^{0.0356}-1 = 3.62\%\), controlling for all other variables in the model.
This problem is meant to offer another chance to demonstrate understanding of some of the material on the mid-term. If you choose to do this problem and your score is higher than your mid-term grade, then your mid-term grade will be reweighted to be New Midterm Grade = .8*Old Midterm Grade + .2*Extra Credit Problem